Interactions Between Weighting Scheme and Similarity Coefficient in Similarity-Based Virtual Screening

نویسندگان

  • John D. Holliday
  • Peter Willett
  • Hua Xiang
چکیده

Similarity searching is one of the most common methods for ligand-based virtual screening, and is normally carried out using the Tanimoto coefficient with binary fingerprints. However, a recent study has suggested that it may be less appropriate for use with weighted fingerprints in some circumstances. This paper compares the Tanimoto coefficient with other coefficients, and demonstrates that one of these, the cosine coefficient, exhibits a much greater degree of robustness in the face of variations in the nature of the fragment weighting scheme that is being used. DOI: 10.4018/ijcce.2012070103 International Journal of Chemoinformatics and Chemical Engineering, 2(2), 28-41, July-December 2012 29 Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. The effectiveness of similarity searching, i.e., its ability to identify bioactive molecules, is determined by the similarity measure that determines the degree of resemblance between the reference structure and each of the database structures. A similarity measure has three components: the descriptors that are used to represent each of the molecules; the weighting scheme that is used to weight different parts of the representation to reflect their relative degrees of importance; and the similarity coefficient that quantifies the degree of resemblance between two weighted sets of descriptors. Although many types of descriptor have been used in similarity searching, by far the best established is a 2D fingerprint, a binary vector in which bits are set to denote the presence of fragment substructures in a molecule (Willett, 2006, 2009). Binary 2D fingerprints are normally used with the Tanimoto coefficient, a simple association coefficient in which the limiting values of zero and unity denote two fingerprints having no bits (and hence having no substructures) in common and two identical fingerprints, respectively. Many other types of coefficient can be used, but comparative experiments have demonstrated the general effectiveness of the Tanimoto coefficient, and this is the basis for similarity searching facilities in most operational chemoinformatics systems (Leach & Gillet, 2007). There have been many comparisons of fingerprints and similarity coefficients for similarity searching, e.g., the detailed studies by Bender et al. (2009), Hert et al. (2004), Duan et al. (2010), and Sastry et al. (2010). Despite some limited early work (Willett & Winterman, 1986), there has been less interest in the use of weighted fingerprints, in which the elements of the vector contain not binary values denoting the presence or absence of fragment substructures, but integer or real values denoting the relative importance of the fragments. A fragment with a high weight occurring in both a reference structure and a database structure will then make a greater contribution to the overall degree of inter-molecular similarity than will a fragment in common that has a lesser weight. There are two main sources of frequency information that can be used for fragment weighting: weights based on the number of times that a fragment occurs in an individual molecule; and weights based on the number of times that a fragment occurs in an entire database. Both types of weighting have been studied in recent work by Arif et al. (2009, 2010), who found that the former type of weighting could bring about notable increases in screening effectiveness in some circumstances, but that the latter type was of less general applicability. We hence focus here on the former approach, i.e., on exploiting information on how frequently fragments occur within individual molecules. Given its widespread usage with binary fingerprints, Arif et al. (2009) used the Tanimoto coefficient in their experiments on frequencybased weighting, but found that problems could arise that were absent when conventional binary fingerprints were being compared. Specifically, they found that even quite small variations in the weighting scheme could affect the magnitudes of the Tanimoto coefficients that are calculated in a similarity search; most notably they found that if there is a large discrepancy in the weights computed for the reference structure and for the database structure then screening effectiveness is likely to be markedly less than if the two weights are of comparable magnitude. This behavior was ascribed to the precise mathematical form of the Tanimoto coefficient, and it was suggested that other types of coefficient might be less affected by changes in the weighting scheme that was being used. Here, we seek to determine whether this is, indeed, the case and whether, accordingly, other coefficients may be preferable to the Tanimoto coefficient when frequency-weighted fingerprints are used for similarity-based virtual screening.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Analysis and use of fragment-occurrence data in similarity-based virtual screening

Current systems for similarity-based virtual screening use similarity measures in which all the fragments in a fingerprint contribute equally to the calculation of structural similarity. This paper discusses the weighting of fragments on the basis of their frequencies of occurrence in molecules. Extensive experiments with sets of active molecules from the MDL Drug Data Report and the World of M...

متن کامل

New Fragment Weighting Scheme for the Bayesian Inference Network in Ligand-Based Virtual Screening

Many of the conventional similarity methods assume that molecular fragments that do not relate to biological activity carry the same weight as the important ones. One possible approach to this problem is to use the Bayesian inference network (BIN), which models molecules and reference structures as probabilistic inference networks. The relationships between molecules and reference structures in...

متن کامل

Similarity-based approaches to virtual screening.

Current similarity measures for virtual screening are based on the use of molecular fingerprints and the Tanimoto coefficient. This paper describes two ways in which one can increase the effectiveness of similarity-based virtual screening: using similarity coefficients other than the Tanimoto coefficient for the comparison of molecular fingerprints; and using a graph-theoretic similarity measur...

متن کامل

Machine Cell Formation Based on a New Similarity Coefficient

One of the designs of cellular manufacturing systems (CMS) requires that a machine population be partitioned into machine cells. Numerous methods are available for clustering machines into machine cells. One method involves using a similarity coefficient. Similarity coefficients between machines are not absolute, and they still need more attention from researchers. Although there are a number o...

متن کامل

Novel Small Molecules against Two Binding Sites of Wnt2 Protein as potential Drug Candidates for Colorectal Cancer: A Structure Based Virtual Screening Approach

Wnts are the major ligands responsible for activating Wnt signaling pathway through binding to Frizzled proteins (Fzd) as the receptors. Among these ligands, Wnt2 plays the main role in the tumorigenesis of several human cancers especially colorectal cancer (CRC). Therefore, it can be considered as a potential drug target.The aim of this study was to identify potential drug candidates ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • IJCCE

دوره 2  شماره 

صفحات  -

تاریخ انتشار 2012